Description: the implementation of "The Giraffe System"

Version: 1.2.0.20210721

Group name: YYDS

Authors: Haodong Liu and Jichen Zhao

Airbnb has become a popular platform among holidaymakers and tourists for lodging and rental houses. A host could manage his/her listings, and a guest could select one to fulfill his/her unique and personalised travelling plans. A public Airbnb dataset would be discovered for visualisation tasks. It regards the summary info and metrics of some listings in New York City (NYC), New York, USA for 2019. The data table is stored in the CSV file Airbnb_NYC_2019.csv, which is downloaded from the corresponding dataset info page on Kaggle.

Two information visualisation "systems" have been implemented - "The Giraffe System" (hereinafter called Giraffe) and "The Zebra System" (hereinafter called Zebra). It is because the visualisation tasks would be defined in the same context but different contents. For example, both Giraffe and Zebra would explore a task to consume information by analysing the data, but the specifications would be various. Anyway, we would expect that both of them could provide general insights into the NYC listings for 2019 since the visualisation tasks should help visualise and understand the primary data features and correlations.

The Giraffe System

Importing Modules

NOTE: Please ensure that no exception is thrown in this section before executing the other sections.

Preparing Data

The dataset contains too many records, and we do not want to bypass Altair's MaxRows check. Hence, we would like to randomly select 5000 listings as the items pending investigation for demonstration purposes. Missing values would be examined for further data processing.

Each item (i.e., a listing) originally has 16 attributes as follows. We would keep relevant attributes for visualisation tasks.

Attribute Description Kept
id The listing ID
name The listing name
host_id The host ID
host_name The host name
neighbourhood_group One of the 5 boroughs in NYC
neighbourhood One of the neighbourhoods in NYC
latitude The latitude coordinate
longitude The longitude coordinate
room_type One of the room types defined by Airbnb
price The price in US dollars for a night stay
minimum_nights The minimum number of nights that a guest can book
number_of_reviews The number of reviews
last_review The date of the latest review
reviews_per_month The number of reviews per month
calculated_host_listings_count The number of different listings for a particular host
availability_365 The number of days for which a particular listing is available in a year

NOTE:

  1. The attributes name and host_name would be removed. We already have unique IDs for listings and hosts, and we are not interested in their names. Hence, they would be dropped to also avoid any potential ethical issue.
  2. The attributes neighbourhood_group, neighbourhood, and room_type are categorical. This attribute type could be vital for information visualisation.
  3. The attributes minimum_nights and availability_365 would be removed. These attributes could be significantly subject to the host preferences, and we are not interested in such future data.
  4. The attribute number_of_reviews would be removed. The listings could be added at different time, and we reckon that the attribute reviews_per_month would be more meaningful. It contains missing values because a particular listing could have no review. In this case, we could simply fill these values with 0.
  5. The attribute last_review would be removed. We would focus on the generic trend, distribution, etc. This attribute could contribute little for visualisation tasks, since its value could be null and we do not have another clear date for comparison.

Visualising Data

Let us first define some common variables.

Before visualisation, it is necessary to understand the interactive nature of charts created using Altair. In plain English, it is essential for you to take advantage of the following features.

The 7 visualisation tasks are defined as follows. Giraffe shares almost the same sections as Zebra from the start till here because they are necessary preparations. However, the following sections could vary considerably from those of Zebra since we perform the same visualisation tasks using different design decisions.

Task Action Specification
#1 Analyse and consume Discover the number of listings by borough and room type to find a borough with the most listings and entire rooms/apartments.
#2 Analyse and produce Derive the per cent of room type by borough to compare between the 2 categories.
#3 Search Look up the number of Manhattan's neighbourhoods in the top 10 neighbourhoods by the number of listings.
#4 Search Browse the host ranking by the number of reviews per month and the number of listings to find the host ranking first in each case.
#5 Search Locate the most popular price range for each borough/room type.
#6 Search Explore the price distribution by room type.
#7 Query Identify, compare, and summarise the correlations among prices, the number of listings, boroughs, and room types.

Bar Charts: Stacked or Grouped?

People might be interested in the question like "who has the most...?" when it comes to comparisons. Bar charts would be a good choice. However, if there are multiple categories for grouping, we had better consider whether the charts need to be stacked or grouped.

NOTE:

  1. Tasks associated: #1, #2. The per cent is calculated based on the existing data.
  2. The design decision for Giraffe here is to use stacked bar charts to visualise the number of listings by borough and the per cent of room types by borough.
  3. Tooltips are enabled for each bar.
  4. Legend selection is enabled.

Bar Charts: Horizontal or Vertical?

Sometimes bar charts are used for visualising the rank. We would not say that one surpasses the other, but which one might be more suitable for a specific scenario?

NOTE:

  1. Task associated: #3.
  2. The design decision for Giraffe here is to use a horizontal bar chart to visualise the top 10 neighbourhoods by the number of listings.
  3. Tooltips are enabled for each bar.
  4. Legend selection is enabled.

Bar Charts: Ordered or Unordered?

Still for a ranking bar chart, it usually consists of a categorical attribute and a quantitative attribute which could be ordered. Is it always a good practice to visualise the data in a specific order?

NOTE:

  1. Task associated: #4.
  2. The design decision for Giraffe here is to use ordered bar charts to visualise the top 10 hosts by the number of reviews per month and the number of listings.
  3. Tooltips are enabled for each bar.

Relationship: Bar Charts or Line Charts?

Line charts might be preferred when we try to visualise any relationship or trend. But we should admit that bar charts are versatile. Why not just try and compare them?

NOTE:

  1. Task associated: #5.
  2. The design decision for Giraffe here is to use bar charts to visualise the relationship between the number of listings and prices, by room type and borough.
  3. Legend selection is enabled.
  4. Scale binding is enabled.
  5. A price filter is provided.

Distribution: Box Plots or Violin Plots?

Both plots could provide insights into the distribution of a quantitative attribute. Violin plots could also tell about the density. It does not mean that the violin plots are better. But in the context of distribution, which one would be preferred?

NOTE:

  1. Task associated: #6.
  2. The design decision for Giraffe here is to use a box plot to visualise the primary distribution of prices by room type.
  3. Tooltips are enabled for each box.

Heatmap: Hue or Saturation?

It is incredibly convenient to generate a heatmap based on geo-location for this dataset due to the latitude and longitude attributes. Selecting a suitable colour scheme would be vital for successful visualisation. We reckon that it is better to use saturation of the same hue. However, we would like to pretend forgetting it and perform the specific visualisation task. XD

You live, and you learn.

NOTE:

  1. Task associated: #7. Some other charts are also created in addition to the heatmap to complete the visualisation task.
  2. The design decision for Giraffe here is to use hue to visualise the price distribution by location.
  3. Tooltips are enabled for almost all chart elements.
  4. Scale binding is enabled for the sub-chart illustrating the number of listings by price.
  5. A price filter and a borough filter are provided.